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Abstract 



One strategy for discovering the connections between social policy interventions and behavioral 
outcomes is to conduct social experiments that use random assignment research designs. Al- 
though random assignment experiments provide reliable estimates of the effects of a particular 
policy, they do not reveal how a policy brings about its effects. If policymakers had answers to 
the “how” questions, they could design more effective interventions and make more informed 
policy trade-offs. This paper reviews one promising approach to specifying the causal paths by 
which impacts are expected to occur: instrumental variables analysis, a method of estimating the 
effects of intervening variables — also called mediating variables, or mediators — that link in- 
terventions and outcomes. It explores the feasibility of applying this approach to data from ran- 
dom assignment designs, reviews the policy questions that can be answered using the approach, 
and outlines the conditions that have to be met for the effects of mediating variables to be esti- 
mated. Illustrations of instrumental variables analysis based on data from random assignment 
studies are also presented. 
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Introduction 



Because so many factors influence human behavior, making a clear link between be- 
havioral influences and behavioral outcomes is anything but straightforward. One strategy for 
making such links is to conduct social experiments that use random assignment research de- 
signs to answer questions about the effects of social policy interventions. The use of random 
assignment in such experiments eliminates most common sources of bias from these estimates, 
producing findings that are largely undisputed and easy to interpret (see Robins and Greenberg, 
1986; Orr, 1999, for a review). By assigning individuals at random to treatment and control 
groups, any difference between the two groups can be attributed to the treatment. 

In principle, random assignment experiments can be designed to answer any social pol- 
icy question. In practice, however, random assignment experiments have important limitations. 
First, these experiments require a well-controlled and well-defined “counterfactual” state. (The 
counterfactual is the condition that would have existed in the absence of the policy intervention 
or program.) This counterfactual state, to which control group members in the experiment are 
assigned, determines and limits which policy questions the random assignment experiment can 
answer, as the effect of the intervention is always determined relative to this counterfactual. 
Second, the policy question being studied has to be “assignable,” meaning that enrollment in a 
program or exposure to a “treatment” is manipulated through an external mechanism that is part 
of the research design. Often, policy questions cannot be manipulated that way. Several re- 
searchers, for example, have attempted to measure the effects of receiving a General Educa- 
tional Development certificate (GED) on those who earn this alternative high school credential 
(Bos et al., 2001; Cameron and Heckman, 1998; Tyler, Mumane and Willet, 2001), but since a 
GED is earned, and cannot be assigned by researchers, all of these studies relied on nonexperi- 
mental comparisons of GED-holders and others. Third, nonparticipation and varying levels of 
participation in a treatment can affect the estimates of the effect of the treatment. A recent study, 
for example, examined the relationship between the number of months that welfare recipients 
spent in adult education programs and their subsequent employment outcomes (Bos et al., 
2001). While welfare recipients in this study were randomly assigned to an “adult education” 
program stream, many did not participate, and those who did participate received education ser- 
vices for widely varying amounts of time. Both of these decisions were beyond the control of the 
researchers despite the underlying experimental research design. As a result, the estimates of the 
effects of the program were attenuated due to nonparticipation, and estimates of the relationship 
between the amount of participation and subsequent outcomes were strictly nonexperimental. 

Finally, experimental designs do a very good job of providing estimates of the effects of 
a particular policy, but not about how those effects occurred. Yet many researchers are inter- 
ested in how programs that are evaluated using random assignment designs achieve their effects 
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on those assigned to them. Answering these “how” questions can help policymakers design 
more effective interventions and can help them make difficult policy trade-offs. For example, 
recent evaluations of welfare and work programs have found that some of these programs may 
improve school outcomes for children in elementary school (Morris et al., 2001). However, an 
open question is whether these effects are driven by the increased financial and other resources 
available to program participants (since increases in family income are believed to benefit chil- 
dren) or by the increased employment among their parents (since working parents function as 
role models for their children, possibly increasing children’s motivation to do well in school, 
and employment may also enhance the regularity of family routines). Knowing how these com- 
ponents work together to produce the desired effects could inform funding decisions that trade 
off resources spent on services such as child care and case management and resources chan- 
neled directly to low-income families. These “how” questions about the overall program effects 
are often considered part of a “black box” of program effects, a term that reflects the common 
perception that these questions are essentially unanswerable or at least very difficult to address. 

This chapter reviews one promising approach for clearly specifying the causal paths by 
which impacts are expected to occur; a method of analysis that can estimate the effects of inter- 
vening variables — also called mediating variables, or “mediators.” Subsequent sections ex- 
plore the feasibility of this approach using data fi*om random assignment designs, review the 
policy questions that can be answered using such an approach, and introduce the conditions that 
have to be met to estimate the effects of mediating variables. 



Instrumental Variables Analysis as a Nonexperimental Alternative 
to Understanding Program Impacts 

Nonexperimental research methods can be used to address policy questions that are not 
readily tested using random assignment or are concerned with how programs produce their ef- 
fects. Such methods include cross-sectional comparisons of outcomes across different levels of 
a policy variable in a sample, longitudinal analyses of changes in those outcomes over time, or 
combinations of the two. When examining the causal effect of one variable, researchers face the 
need to isolate the particular variable of interest from other important variables that are corre- 
lated with it. Thus, going back to an earlier example, in comparing the outcomes of GED hold- 
ers with those of GED nonholders, they need to account for social background variables that 
may cause some people to pursue such a credential while others do not. In the following model. 
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Yi is the outcome, the background variables are shown as Z,*, and the policy variable of interest 
is labeled^.' 

( 1 ) 

I 

f 

Including the background variables Zik is intended to account for the fact that GED- 
holders and those without a GED have different background characteristics, and controlling for 
them should improve estimates of the causal effect of having a GED. However, though theory 
can help determine which background variables should be included in an empirical estimation, 
some of these background variables are unobservable or difficult to measure, such as motiva- 
tion. Consequently, it is virtually impossible to have all the information necessary — that is, all 
the “theoretically-motivated” background variables — available to include in the empirical es- 
timation. This is a challenge that faces anyone who conducts nonexperimental research. It is 
always possible that a key explanatory variable Z/ is left out of the analysis. If that is the case, its 
effect on the outcome may be misattributed to the policy variable 

Many books and articles have been written about the limitations of nonexperimental re- 
search methods, most of which have to do with researchers’ inability to take account of all al- 
ternative explanations for the empirical relationships they observe in nonexperimental data. 
These limitations affect both types of nonexperimental research being conducted. In cross- 
sectional research, failure to account for important alternative explanations is commonly known 
as “selection bias,” and in longitudinal research such a failure creates “history” or “maturation” 
bias (see Cook and Campbell, 1979, for an extensive discussion of these limitations of nonex- 
perimental research). 

Two distinctive techniques — “differencing” with data fi*om natural experiments and 
instrumental variables analysis — represent common approaches to capitalizing on exogenous 
variation in policy variables to control for these biases in understanding empirical relationships. 

Natural Experiments 

So-called “natural experiments” are the most common, and arguably the most intuitive, 
technique. In a natural experiment, researchers take advantage of a situation in which two oth- 
erwise identical groups (or time periods) are affected differently by a “natural” event Pi that is 
exogenous to the relationship between Yj and Xi and causes a sufficiently large change in both. 



^Xi is called a policy variable throughout this paper because it captures a well-understood construct that poli- 
cymakers might want to manipulate, such as educational attainment, child care use, or family income. However, 
in almost every case discussed here, it is not feasible to apply random assignment to^- itself. 
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In a famous example in the minimum wage literature, Card and Krueger (1994) compared the 
labor demand of fast food restaurants in the border region of Pennsylvania and New Jersey after 
the minimum wage was raised in New Jersey. Contrary to the prevailing theory, they found no 
evidence that this policy decision reduced the demand for fast food labor. Their results were 
convincing because economic and demographic conditions within their cross-state sample were 
virtually identical except for the change in the minimum wage law? The presence of a state bor- 
der and the difference in state policy created a natural experiment to test the effect of wage pol- 
icy on the demand for low-wage labor. 

There are numerous examples of research that uses such “natural” variation in policy 
variables to produce unbiased estimates of the effects of those policy variables on human behav- 
ior and economic conditions. For example, Hotz et al. (1999) estimated the effects of teen child- 
bearing by comparing outcomes for teens who had miscarriages with outcomes for teens who 
carried their children to term. Hoxby (2001) estimated the effects of vouchers on school choice 
using school district boundaries determined by streams. And, recently, Angrist (forthcoming, 
2002) estimated the effects of sex ratios on marriage and labor market outcomes using immigra- 
tion inflows into the United States. 



Instrumental Variables Estimation 



Another powerful nonexperimental technique for addressing questions of causality is 
the use of instrumental variables (IV) estimation strategies. These strategies rely on finding an 
independent (“exogenous”) source of random variation in the policy variable Xi whose effects 
are being analyzed. This exogenous variable (hereafter referred to as P/) is known as the “in- 
strument.” In a simple instrumental variables framework, the effect of on outcome Yi is esti- 
mated by comparing the effect of P/ on Yi to the effect of P/ on X„ or more explicitly: 



dY, . dP; 
dX, dX. ’ 

dP> 



( 2 ) 



where is the effect of Xj on Yi, is the effect of P, on Yu and 

dX. dP, 



— is the effect of P, on X. 
dp 



^ While there was little criticism of the underlying research design, some researchers did question other as- 
pects of this study, such as data collection and sample selection (Wascher and Neumark, 1992, 1994). 
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Limitations of Nonrandomized Instruments 



These two techniques illustrate that to produce valid findings, the variation in the policy 
variable Xi does not have to be random as in a real experiment provided that this variation is 
exogenous to outcome Yi. More formally, this is expressed as follows: 



Y, =83+0 Z, +HELZ,, , Where (3) 

1 

X, =n+m, (4) 

COK(/^,Z,) = 0,and (5) 

COK|(P,,,X)|>0,and (6) 

COV(Y^,P,.\X,) = 0. (7) 



Remember, in such a system of equations, Pi is the “instrument” for^. The equation in 
which the instrument is used to predict Xi is called the first-stage equation (with the second- 
stage equation being that in which the predicted value of A 7 is used to predict Yi). Because all 
possible variables Z/ are not known, one cannot be certain that COV(P/,Z/) is indeed zero, unless 
Pi is a random variable. 

This observation highlights one of the limitations of natural experiments. Although 
natural experiments add valuable independent variation into analyses of effects of policy vari- 
ables on outcomes, it is almost never possible to guarantee that Pi is uncorrelated with all un- 
measured variables Z/. For example, in cases where “natural” policy variation across jurisdic- 
tional boundaries is studied, researchers must assume that people are randomly distributed on 
different sides of the boundary or at least that their choice to live on one side or the other is un- 
correlated with the policy variable X. Given that many of these policy variables are affected by 
systematic preferences or differential ability to choose where to live (e.g., some people may not 
have the economic means to live on one side of the boundary), it is often difficult to make a 
convincing argument that Pi is indeed uncorrelated with Z/. Moreover, it is often very difficult to 
find natural experiments to answer critical policy questions. Using natural experiments requires 
a great deal of opportunism on the part of researchers, which often leads to compromises in 
other aspects of the research, such as the study’s generalizability, the quality of the available 
data, or the researchers’ ability to measure P/ with precision. The most highly regarded exam- 
ples of “natural’’ experiments in policy research are those that use true lotteries that are imple- 
mented for nonresearch reasons. In the labor economic literature, many studies rely, for exam- 
ple, on the Vietnam era draft lottery, which constituted a strong exogenous incentive to change 
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one’s behavior for many of those affected (e.g., enroll in college, have a family, volunteer for the 
army, or move to Canada). Studies based on this lottery were used to estimate the economic bene- 
fits of postsecondary education (Angrist and Krueger, 1992). However, drastic lottery-based 
“treatments” like the Vietnam era draft lottery are very rare, which means that researchers who 
rely on natural experiments rarely have such convincing instruments to work with. 

Random Assignment as an Instrument 

Since a variable is only a “good” instrument if it is known a priori to be uncorrelated 
with any unmeasured explanatory variables Zik, the best instruments are those whose values are 
assigned randomly. Randomized experiments are designed to create a “program” variable that 
has randomly assigned values. Provided that a program variable has a meaningful relationship 
with a policy variable of interest, it is a natural choice of instrument for this policy variable. 
Thus, for example, if researchers studying the effects of the GED on earnings could identify an 
experimental treatment that affected GED receipt (such as a program that promoted taking the 
GED test among high school dropouts with sufficient skills to pass it), they could, in theory, use 
the experimental treatment variable as an instrument for GED receipt. In the remainder of this 
paper, we will explore the assumptions under which such an approach would be valid, its limita- 
tions in terms of statistical power and data requirements, and the potential to develop studies 
that would use random assignment expressly as a way to generate valid and powerful instru- 
ments to address important policy questions. 



Policy Questions Answered by IV and the Assumptions Needed 
to Answer These Questions 

What Policy Questions Can IV Answer When Combined with 

Experimental Designs? 

Using IV with random assignment allows researchers to answer a wider range of policy 
questions than are answered by random assignment studies alone. This section describes the 
range of questions that can be addressed when random assignment studies are combined with 
rV estimation strategies. Moreover, this section demonstrates how instrumental variables esti- 
mators can help make the answers fi*om random assignment studies speak more directly to the 
policy questions that researchers are ultimately trying to address (see Angrist, Imbens, and 
Rubin, 1996, for fiirther discussion). For definitions of the key terms used in this section, see 
the text box on page 8. 
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An “Intention’' to Treat 



When a random assignment experiment is conducted, the difference in outcomes be- 
tween those assigned to the program group and those assigned to the control group is a fully 
experimental (and valid) estimate of the program effect. However, in many cases, the “pro- 
gram” whose effect is captured is not equivalent to the policy being studied. For example, sup- 
pose a training program aims to provide 30 weeks of vocational skills training to those who en- 
roll. A sample of applicants for the program is recruited, random assignment is conducted, and 
half of the sample is offered the program. Many of those randomly assigned to the program 
group follow through and indeed participate for 30 weeks. However, some drop out, and others 
do not show up at all. As a result, the program-control group difference at the end of the study 
does not constitute the effect of 30 weeks of vocational skills training. Instead, it captures the ef- 
fect of a program’s intention to provide such a level of training. Among economists, this effect is 
known as the “intent-to-treat” (or ITT) effect. The relevance of the ITT estimate for policymakers 
depends on the rate of take-up in the program and the initial policy question. Critics of random 
assignment studies often point to the limitations of ITT estimates as a major drawback of answer- 
ing policy questions with random assignment research (e.g., see Heckman, 1997). 

The Effect of Receiving the Treatment 

Many of these critics argue that a more relevant measure is the effect of “treatment on 
the treated” (TOT). In the example above, this measure would capture the effect of actually re- 
ceiving the 30 scheduled weeks of training (the “treatment” in the context of that example). It 
answers the question of how those who received the training benefited from their experience. 
However, there are two serious problems with the TOT effect. First, it is very difficult to esti- 
mate, because it is difficult to establish a priori who is going to be among the “treated.” (If that 
were possible, the experiment could just be limited to that group.) Second, even if it were possi- 
ble to reliably estimate a TOT effect, it would be difficult to generalize it to a wider policy. 
Knowing, for example, that 30 weeks of vocational skills training produces an X percent in- 
crease in the earnings of those who received it does not answer the question of how to go about 
getting those who need higher earnings to participate in such training. 

Instrumental variables estimation strategies can be used to approximate TOT effects. 
The coefficient in Equation 3 is often considered an instrumental variables estimator of the 
TOT effect of Xi on Yi. In that example, the instrument Pi is the program (or the “intention to 
treat”) and is the “treatment.” However, as Angrist et al. (1996) point out, the effect is not 
a true TOT effect but a special case of such an effect, which they define as the “local average 
treatment effect” (LATE). Whereas TOT is the general effect of a treatment on those who re- 
ceive it, LATE is defined more precisely as the effect of a “treatment” on those who are induced 
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by a specific program to receive it.^ In experimental research, these individuals are also known 
as “compliers” (sample members who would not have received the “treatment” had they been 
assigned to the control group; Angrist et al., 1996). A key difference between LATE and TOT is 
that the latter applies to a clearly defined and identifiable subpopulation, namely, those who re- 
ceive the treatment (Heckman, 1996a, 1996b). (In the example above, these are all the sample 
members for whom = 1, regardless of the value of Pi) The LATE estimator is limited to 
compliers, who cannot be identified either a priori or ex post facto in any experiment. 



Definition of Terms 

ITT (Intent to Treat). ITT captures the effect of a program’s intention to provide a 
certain level of services or benefits. 

TOT (Treatment on the Treated). TOT captures the effect of a program on those 
who received it. 

LATE (Local Average Treatment Effect). LATE captures the effects of a program 
on those who were induced to receive it, that is, compliers. 

ATE (Average Treatment Effect). ATE generalizes LATE a broader population of 
compliers. 



Uses of the Local Average Treatment Effect 

The relevance of LATE estimators is hotly debated. In support of LATE estimators, one 
might argue that it is especially helpful to understand the effects of a policy variable on those for 
whom this variable can be manipulated through an external program (as opposed to those for 
whom such manipulation makes no difference). However, the usefulness of LATE estimators is 
limited by researchers’ inability to understand or control the process of compliance. As Robins 
and Greenland (1996) argue, if an experiment shows that a treatment A; is beneficial, making A/ 
widely available in the population will result in a different take-up pattern than that found in the 
experiment that produced the LATE estimator of A/’s effect. For example, many more people 
may take aspirin after a study’s results show it to be effective in treating heart attacks. There- 
fore, there will probably be more — and different — compliers. The actual effect that would 
take this into account, which Robins and Greenland refer to as the “average treatment effect” 



^Note that the temi treatment is easily confused with the experimental status. In discussions of TOT and 
LATE effects, the “treatment” is the policy variable referred to earlier as X,, not the experimentally manipu- 
lated program variable F,. 



- 8 - 




14 



(ATE), cannot be estimated directly, but they demonstrate that it is possible to develop bounds 
for this effect, given valid estimates of the ITT and LATE effects.'* 

Given a number of strict assumptions, which will be discussed below, Angrist et al. 
(1996) show that the instrumental variables estimator j^j is a valid LATE estimator of the effect 
of Xi on Yi. It is not a valid TOT estimator of Xi on Yi, unless everyone who receives the treat- 
ment Xi does so by way of program But, as discussed above, LATE estimators are policy 
relevant because they capture the effects of externally induced changes in policy variables, 
which is the purpose of much social policy research. The following section describes these as- 
sumptions and discusses both the likelihood that they are violated in real-life situations and the 
consequences such violations would have for the validity of estimated effects. 

Assumptions to Identify Causal Effects 

Consider again the instrumental variables estimator presented in Equations 2 and 3. There 
is a randomized program variable Pi, which produces a change of Bj, in the policy variable Xj. The 
change in outcome Yi associated with this randomly induced change in Xi is captured by the in- 
strumental variables estimator ^ , which is a valid LATE estimator of the effect of^/ on Yi if the 
following five assumptions are met (Angrist et al., 1996; for a list, see the text box on page 1 1). 

First, the instrument Pi is assumed to be a randomly assigned variable, meaning that it is 
uncorrelated with demographic characteristics of persons i, with preprogreim levels of outcome 
Yi, and with any other preprogram variables Z/ that could predict Yi. As mentioned above, in- 
stead of being randomly assigned, it would be sufficient for Pi to be a nonrandom but truly ex- 
ogenous variable, provided that it were possible to demonstrate that it was. 

Second, there must be a meaningful effect of the instrument Pi on the policy variable Xi, 
i.e., S/S 0. In later sections of this chapter, it will become clear that the larger the program effect 
Bi is, the more reliable the instrumental variables estimator will be. When Sj, is small, P, is 
said to be a “weak” instrument. 

The third assumption is referred to by Angrist et al. (1996) as the “stable unit treatment 
value assumption” (SUTVA). This assumption requires that the values of policy variable X and 
the relationships between those policy values and outcome values Y across individuals i are 
“stable” for those individuals, that is, unaffected by variation in or Y for other individuals. 
Without this assumption, it is impossible to draw reliable inferences about the effects of the pol- 
icy variable on the outcome, regardless of the type of statistical analysis that is used. In practice. 



'*Robins and Greenland (1996) present an extensive review of the biomedical literature to support 
these bounds. 



this means that there are assumed to be no community effects or displacement effects; V, X, and 
P are independent across different individuals i. 

A violation of SUTVA creates a bias in the estimator )2j . The size and direction of this 
bias depend on the nature and seriousness of the violation. For example, consider the hypotheti- 
cal training program introduced earlier. Say there is a limited number of skilled positions in a 
geographic area and program P dramatically increases the number of persons holding a training 
credential. As a result, employers are less willing to pay high wages to retain credentialed per- 
sons, that is, the premium associated with a credential is reduced. In this case, the increased 
level of educational attainment (X,) in the community, associated with program P, has lowered 
earnings for those who would have held a credential even without program P. As a result, the 
estimated effect is an underestimate of the true benefit associated with receipt of a creden- 
tial. Fortunately, it is reasonable to assume that SUTVA holds in most cases as the size of pro- 
grams tends to be small relative to the communities in which they are implemented. 

A fourth assumption is that the program effect B/ on the policy variable (i.e., the effect 
of Pj on Xi) is monotonic. This means that when B)r > 0, (A^|P/=7) is greater than or equal to 
(Xi\Pi=0) for every person i. Thus, for example, if P, is a training program and Xi is the amount 
of training received, no one randomly assigned to the program receives less training than s/he 
would have received if s/he had not been assigned to the program. The consequences of violat- 
ing this assumption depend on the relationship between Xi and Yi for compliers and “defiers” 
(defined as individuals who would have received the “treatment” if they had been assigned to 
the control group but do not receive it when assigned to the program group). Unless this rela- 
tionship varies across these two groups — in this example, the effect of training (X,) on earnings 
(fi) is different for compliers than for defiers — there is no bias in the estimator!^ . However, 
in reality it is impossible to prove that a decision to defy is exogenous to the relationship be- 
tween^ and Yi, so any violation of the monotonicity assumption is potentially serious. 

The fifth assumption is known as the exclusion restriction. It states that any effect of the 
program Pi on the outcome Yi must be mediated by the policy variable Xi for P, to be a valid 
instrument for When there is an effect of P, on Yi that is not mediated by Xi, the instrumental 
variable estimator may misattribute this effect to^, causing the estimate ofQ to be biased. The 
closer program P, and policy variable Xi are conceptually, the less likely such a violation of the 
exclusion restriction is likely to occur, and the less severe the bias will be when it does. (Recall 
that the causal inference about the policy effect would be strongest if^ itself were subject to 
random assignment, in which case P, =^.) 

In summary, the first three of these five assumptions are fairly easily met in most social 
experiments, provided that the experiment has a sufficiently strong effect on the policy variable, 
random assignment is carried out well, and the scale of the experiment is small enough to pre- 
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elude significant neighborhood or displacement effects. However, the monotonicity assumption 
and the exclusion criterion require considerable scrutiny if a social experiment is to be used as 
the basis for an instrumental variables analysis. 



Key Assumptions for Identifying an IV Model 
with Data from a Random Assignment Design (Calculating LATE) 

Instrument is Exogenous. The instrument, is randomly assigned, uncorrelated with 
unobserved characteristics of persons /, including pre-program versions of outcome Y^. 

Instrument has a meaningful effect. The instrument, Pi , must be a reliable predictor 
of the policy variable, Xi. 

SUTVA (stable unit treatment value). The value of the policy variable, X, and the re- 
lationship between it and the outcome, f,, is not affected by variation in X and Y for 
others. 

Monotonicity. The effect of the instrument, on the policy variable, is not less 
than or smaller than an effect on^ that would otherwise occur. 

Exclusion restriction. Any effect of the instrument, on the outcome, f,, must be 
mediated by the policy variable, X, conditional on observed characteristics. 



Consequences of Defiance 

To address the monotonicity assumption, it is necessary that researchers have a clear 
understanding of the direction of the expected program effects SS, (of Pi on Xf ) and the extent 
and nature of defiance among study participants in the experiment. 

To illustrate the potential effect of defiance on instrumental variables estimates from a 
social experiment, consider the possibility of using a multigroup random assignment evaluation 
of welfare-to-work programs to study the effects of maternal employment on child outcomes. In 
the National Evaluation of Welfare-to-Work Strategies (NEWWS), funded by the U.S. Depart- 
ment of Health and Human Services and conducted by MDRC, sample members in three of die 
seven sites were randomly assigned to one of three research groups: a labor force attachment 
(LFA) group; a human capital development (HCD) group; or a control group. All the sample 
members were welfare recipients at the time of random assignment. LFA group members were 
offered services to accelerate their entry into employment, and HCD group members were of- 
fered services aimed at increasing their education (see Hamilton et al., 2002, for more details on 
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this study). Both programs produced significant positive effects on employment and earnings 
over time, although increases were significantly larger for the LFA program. Given these pro- 
gram effects, it would be tempting to use the NEWWS data to study the effects of increased 
employment among welfare recipients on a host of other outcomes, such as the well-being of 
their families or their children’s education outcomes. However, using the LFA and HCD pro- 
gram variables as instruments for increased employment is problematic, especially for people 
assigned to the HCD programs, because doing so requires assuming that the expected value of 
program effect Si is positive. But many HCD program group members may have reduced their 
employment, at least initially, to participate in education and training offered through the pro- 
gram. In the context of the employment-based instrumental variables analysis, these participants 
would be considered defiers because, even though they met the program’s requirements, they 
reduced their employment more than they would have in the control group. It is impossible to 
know whether employment would have similar effects on nonemployment outcomes for this 
group as for compliers, that is, people who increased their employment immediately rather than 
returning to school. Consequently, it would be preferable to limit this hypothetical instrumental 
variables analysis to the LFA data, where the effect Sj, is less ambiguous and defiance (in terms 
of the effect of the program on employment) is much less likely. Alternatively, as described 
later in the paper, the HCD data could be used to answer questions about the effects of maternal 
education on children’s outcomes. 

Multiple Pathways 

The exclusion restriction is probably the strongest of the five assumptions and is the 
most difficult to verify conclusively. In one of the examples mentioned above, P/ was assign- 
ment to a training program and^ was defined as receipt of training. In that case, P/ and^ were 
closely related and most of the effect of P/ on Yi was likely to be mediated by Xi, However, 
many training programs provide services other than training to their students, such as job search 
assistance and referrals to other education providers. Unless these services are somehow con- 
trolled for in the analysis, the benefits fi*om these auxiliary services may be attributed to the 
training variable Xi and become part of the estimated effect |^ . In the case of this hypothetical 
training program, the consequences of this particular bias may not be severe, because it essen- 
tially redefines the treatment to include the other program components together with the train- 
ing. However, when programs are more comprehensive and multifaceted, it becomes very diffi- 
cult to separate the effects of one program component fi*om the effects of another. 
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Examining More Than One Causal Path: Multiple Mediators 

As just described, the strongest of the five assumptions necessary to identify an IV model 
is the exclusion restriction requiring that a variable Pi (e.g., assignment to the treatment group) can 
be used as an instrument for the effect of Xi (e.g., participation in a program) on the outcome of 
interest Yi only if the relation between Pi and Yi is fully mediated by P/. Meeting this assumption is 
difficult in practice because programs often aim to affect multiple aspects of behavior and are 
composed of a variety of requirements, services, and incentives to achieve multiple goals (e.g., to 
increase employment through work requirements and reduce poverty through earnings supple- 
ments).^ Consequently, the availability of one instrument is often insufficient to capture all of the 
induced behavior changes that constitute the program effect of Pi on Yi, Under these circum- 
stances, using a randomly assigned program variable to estimate the effect of a single Xi on Yi may 
violate the exclusion restriction and will result in biased estimates of the effects ofXi on Yi. 

For example, consider a research project that uses a random assignment evaluation of a 
welfare-to-work program, PI, to analyze the effect of parental employment on child behavior. Is 
children’s behavior affected by their parents’ decision to seek employment? Figure 1 illustrates 
how such an analysis could be structured. 



Figure 1 



Using Program PI to Analyze the Effect of Parental Employment 
on Child Well-Being 




In this figure, PI represents an employment program that offers an earnings supplement to peo- 
ple who work full time. It is used as an instrument for parental employment in an effort to pro- 



^The simple ratio presented in Equation 2, , will no longer produce a valid IV estimate. 
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duce an unbiased estimate of the effect of parental employment on child well-being. Through its 
rules and services, program PI is expected to increase parental employment, which in turn is 
expected to affect child behavior. However, for this analysis to be valid, the exclusion restric- 
tion requires that parental employment be the only pathway through which PI affects child be- 
havior. And programs like PI often provide parents with additional services, such as child care 
subsidies and advice on how to find good-quality child care. This aspect of program PI could 
constitute a separate pathway through which P 1 might affect child behavior, as illustrated in 
Figure 2. While some of the increased child care may be a result of the increased employment 
among parents (and thus captured in the total effect of employment on children), some parents 
may alter their use of child care even without changing their employment behavior. That is, even 
if the program has no effect on employment, parents in the program may use more or different 
types of child care than other parents. In that case, attributing all of P I’s effect on child behavior to 
changes in parental employment is incorrect and will lead to biased estimates of the effect of pa- 
rental employment on child behavior. 



Figure 2 

Programs Can Affect Employment and Child Care 




In order to address this problem, it is necessary to introduce a second instrument, P2, as illus- 
trated in Figure 3. As will be discussed, this could be a second randomly assigned treatment 
(one emphasizing employment or child care use to a different extent than PI), or it could be a 
different program site or welfare office. 
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Figure 3 



Using Two Program Group Variables to Separately 
Identify Parental Employment and Child Care 




Multiple Mediators and Identifying IV Estimates 

Identifying the empirical IV model depends on satisfaction of the exclusion restriction. 
A model with more than one endogenous variable is “unidentified” by a single instrument and 
requires multiple instruments to achieve identification. More specifically, a model is unidenti- 
fied if there is not enough information to estimate all of its parameters. In effect, there are an 
infinite number of parameters that can satisfy the conditions specified by the model. For the 
model to be identified, there must be at least one instrument for each endogenous variable that is 
being used as a predictor in the second-stage equation. If the number of exogenous instruments 
in the model is equal to the number of endogenous variables that need to be instrumented, the 
equation is “just-identified” or “exactly identified.” In this case, there is just enough information 
to estimate the parameters needed.^ If the number of instruments in the model exceeds the num- 
ber of endogenous variables, the equation is “overidentified.”^ In the next section, we describe a 
number of possible approaches to creating multiple instruments using random assignment ex- 
periments to estimate the impacts of more than one policy variable (or more than one mediator). 



^If one instniment is simply a linear combination of another, the equation would still be unidentified, even 
with the same number of instruments as endogenous variables. As in the case in which too few instruments are 
included, there are an infinite number of parameters that can satisfy the equations. 

^One advantage to overidentified models is that they allow a test of the validity of the instruments. More spe- 
cifically, a test of the overidentifying restrictions can be conducted to test whether there is any association between 
the instruments and the error term in the second-stage equation. Because the instruments are supposed to reflect 
random variation in A^, they should not have any such association. A lack of association between the instruments 
and the error term provides some indication that the instruments are valid. There are several possible explanations 
for an association between the instruments and the error term. One is that the equation is misspecified. 
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Creating Multiple Instruments: Multigroup Randomized 
Experiments 

One approach to constructing multiple instruments is to exploit a multiple group re- 
search design within a study. In a multiple group research design, subjects are randomly as- 
signed to one of several program groups or to a control group. An example of using IV estima- 
tion with data from a study that employed a three-group research design to identify the effects 
of income on children’s well-being is described in Box 1. The multigroup research design pro- 
vides access to several valid instruments, with separate program dummies representing assign- 
ment to the first program group, the second program group, and so forth. The original one- 
policy variable model is thus expanded to: 

m k 

where Xjm are m potentially endogenous policy variables (child care use and employment in the 
example above) and are the estimated effects associated with those variables. Expressed as a 
set of rV models, this equation can be written as follows: 

m k 

+H| . where for every X,„, , (9) 

=«+ +HI . in which (10) 

m,. is the first-stage error term and other covariates Zik are omitted for simplicity. A necessary con- 
dition for this system of equations to be identified is that s, the number of independent instru- 
ments, be equal to or greater than m, the number of mediators X through which P/ affects J/. 

Of course, even after multiple instruments are added, the exclusion restriction still ap- 
plies: The only way that the instruments are assumed to affect the outcome is through the path- 
ways that were explicitly included in the model. This assumes that any other important path- 
ways are not omitted, either because they were not measured or because there were not enough 
instruments to estimate their effects. Thus, implicitly, the contribution of any of these other 
pathways to changes in outcome Yi (e.g., in the example above, any family-level income effects 
associated with the increase in parental employment) is absorbed by the effects associated with 
the included pathways. 

Arguably, multigroup random assignment designs are the cleanest way to produce mul- 
tiple instruments with which to estimate complicated multimediator models like those intro- 
duced above, because they provide more than one truly exogenous instrument. However, it is 
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difficult and expensive to carry out such designs in a real-world program environment. Also, for 
such designs to work in the context of an instrumental variables analysis, there must be suffi- 
cient variation in the effects of the individual program variables on the mediating policy vari- 
ables (i.e., in Figure 3, the effects of PI and P2 on the two mediators must differ from one an- 
other). For example, Box 1 shows that the Minnesota Family Investment Program (MFIP) had a 
larger effect on employment than did an inventives-only variant of the program. However, the 
effects of MFIP and MFIP Incentives Only on income were relatively similar, which may be 
one reason why the IV estimates are not larger. Otherwise, we have, in effect, one rather than 
two independent instruments. All of this means that there are relatively few multigroup social 
policy experiments that have sufficient truly random variation in their treatments to carry out 
complex rV estimation procedures like those outlined above. Other approaches may be neces- 
sary — and are discussed below. 



Box 1 

An Empirical Example of IV Analysis vrith Multiple Mediators 
Using Data from a Multigroup Research Design 

In this study, data from the Minnesota Family Investment Program (MFIP) were used to esti- 
mate the effects of income on child well-being. In MFIP, single-parent families receiving wel- 
fare were randomly assigned to one of three research groups: (1) Full MFIP, (2) MFIP Incen- 
tives Only, or (3) AFDC (the control group). Whereas under AFDC earnings reduced welfare 
payments dollar for dollar, families in both Full MFIP and MFIP Incentives Only were able to 
keep more of their welfare income as their earnings increased. In addition, families in the MFIP 
group were required to participate in employment and training services if they had been on 
welfare for 24 of the prior 36 months (or else face sanctions), while those in the Incentives 
Only group did not face any of these employment and training mandates. Families assigned to 
the AFDC group received the benefits of Minnesota’s AFDC program. 

(continued) 
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Box 1 (continued) 

The first-stage equation predicting income included the two instruments (Pi and P 2 ), representing assign- 
ment to each of two research groups, and a set of baseline characteristics hypothesized to affect income 
and employment. A similar first-stage equation predicting employment was also estimated. Because 
MFIP’s effects on employment and income were strongest in the first year than later in the follow-up pe- 
riod, income and employment in the first year of the study were used as the dependent variables in the 
first-stage equation (Miller et al., 2000). 

The results of the first-stage equations are presented in Table 1.1 below. As is evident fi*om the table, the 
dummy variables representing Full MFIP and MFIP Incentives Only were associated with employment and 
income, a necessary condition for the IV strategy. A test of the effects of the instruments suggests that these 
variables are strong predictors of both employment (F = 13.38, p < .001) and income in the first year (F = 
14.65, p<. 001). 

Table U 

The Effects of MFIP on Employment and Income: 

First-Stage Regression Results for IV Model 





Year One 
Employment 


Year One 
Income 


Three Year 
Employment 


Three Year 
Income 


Full MFIP 


()2(}*** 

(0.04) 


1.40*** 

(0.29) 


0.16*** 

(0.03) 


I.23*** 

(0.36) 


MFIP Incentives 
Only 


0.09*^ 

(0.04) 


1.32*** 

(0.29) 


0.08 

(0.03) 


I.IO*** 

(0.37) 


F value 

Sample Size= 879 


13.38^^’*' 


14.65*** 




6.89*** 



SOURCE: MDRC calculations using MFIP administrative and baseline survey data. 

NOTES: Standard errors in parentheses. 

The sample includes long-term welfare recipients randomly assigned from April 1, 1994, to October 31, 1994, exclud- 
ing the small percentage who were receiving only Food Stamps at random assignment. 

Income is measured in thousands of dollars, in the first year after random assignment and on average over the three-year 
follow-up. Employment is measured as ever employed, in the first year after random assignment and on average over the 
three-year follow-up. 

The regressions also include the following covariates measured at baseline: black, other racial/ethnic minority, mother 
was a teen at child’s birth, number of children in the family, presence of a child age 6 or less, mother had no high school de- 
gree or equivalent, mother never married, mother on welfare 5 or more years, earnings in the prior year, and indicators for the 
quarter of random assignment. 

Two-tailed significance levels are indicated as: ’*' = 10 percent; = 5 percent; ♦♦♦ = 1 percent. 

(continued) 



Box 1 (continued) 



The second-stage equation used predicted income and employment (e.g., Zi/ and Zn ) along with 
the same set of baseline characteristics (excluding the instruments) to predict the child outcomes. 

The results of this second-stage equation are presented in Table 1.2. For comparison, the results of 
analogous OLS estimation methods are also provided. (In the OLS models, the same covariates are 
used as in the second-stage IV estimate, and income and employment are both included in the equa- 
tion, rather than the predicted value of income and employment as in the IV models.) In the OLS 
models, small, insignificant effects of income are found. However, in the IV models, significant 
positive effects of income are found, predicting engagement in school and positive social behavior. 
The effects on the other two variables are in the expected direction (favorable) but are not statisti- 
cally significant. In none of the models did employment have a significant effect (although, 
interestingly, the coeflBcients on the employment measures are always opposite to those of the 
income measures). 
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Box 1 (continued) 

Table 1.2 



OLS and IV Estimates of the Effects of Employment and Income 
on Children’s School and Behavioral Outcomes 





OLS 


IV 

Model 1 


IV 

Model 2 




Effects of Income 






School Achievement 


-0.02 


0.16 


0.20 


(mean = 4.06, sd = 1 . 1 0) 


(0.01) 


(0.14) 


(0.20) 






H:p = .02 


H:p^M2 


School Engagement 


-0.01 


0.47* 


0.59 


(mean = 1 0. 1 0, sd = 1 .82) 


(0.02) 


(0.27) 


(0.42) 






H:p^Ml 


H:p = Ml 


Behavior Problems 


-0.16 


-1.57 


-1.94 


(mean = 1 1 .69, sd = 9.20) 


(0.12) 


(1.23) 


(1.72) 






H:p = JI 


H:p = J0 


Positive Behavior 


0.17 


11.04* 


14.01 


(mean = 196.16, sd = 37.5) 


(0.49) 


(6.18) 


(9.32) 






H:p = M 


H:p = M 




Effects of Employment 






School Achievement 


-0.02 


-0.17 


-0.36 


(mean = 4.06, sd = 1 . 10) 


(0.09) 


(1.08) 


(1.66) 






H:p = M2 


II 


School Engagement 


0.13 


-1.05 


-1.87 


(mean = 1 0. 1 0, sd = 1 .82) 


(0.16) 


(1.86) 


(3.17) 






H:p = ,0l 


II 


Behavior Problems 


0.45 


1.99 


4.32 


(mean = 1 1 .69, sd = 9.20) 


(0.80) 


(8.30) 


(12.87) 






H:p = ,U 


H:p = J0 


Positive Behavior 


-3.19 


-65.05 


-94.62 


(mean = 196.16, sd = 37.5) 


(3.24) 


(41.63) 


(69.74) 






II 


H:p = .0 


Sample Size= 879 
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Box 1 (continued) 



SOURCE: MDRC calculations using MFIP administrative, child survey and baseline survey data. 

NOTES: Standard errors in parentheses. indicates Hausman test (p values of F test are indicated). 

The sample includes long-term welfare recipients randomly assigned from April 1 , 1 994, to October 3 1 , 
1994, excluding the small percentage who were receiving only Food Stamps at random assignment. 

Income is measured in thousands of dollars, in the first year after random assignment (model 1 ) and on av- 
erage over the three-year follow-up (model 2). Employment is measured as ever employed, in the first year after 
random assignment (model 1) and on average over the three-year follow-up (model 2). 

The regressions also include the following covariates measured at baseline: black, other racial/ethnic mi- 
nority, mother was a teen at child's birth, number of children in the family, presence of a child age 6 or less, 
mother had no high school degree or equivalent, mother never married, mother on welfare 5 or more years, earn- 
ings in the prior year, and indicators for the quarter of random assignment. 

Two-tailed significance levels are indicated as: * = 10 percent; ** = 5 percent; = 1 percent. 



Creating Multiple Instruments: Multisite Randomized 
Experiments 

An alternative approach that can be implemented post hoc is to exploit the variability 
that occurs due to the implementation of comparable experiments across multiple sites or of- 
fices. It is possible to create more than one instrument by interacting the random assignment 
treatment variable with a variable representing each of the sites. The result is an analysis in 
which variation in the implementation of the program is used to identify the various pathways 
through which the program affected the outcome. This approach works best when there is a 
fairly large number of sites or offices and program implementation was varied either deliber- 
ately to produce variation in program variables Pi or naturally for reasons exogenous to the pol- 
icy variables Xi and the outcome variables T). Such exogeneity of the variation in Pi across sites 
is essential to safeguard the validity of the instrumental variables analysis. 

Bos and Granger (2000) provide an example using a multisite approach to estimate the 
effect of early day care use on the school readiness of children bom to teen mothers. Using data 
from the 16-site New Chance Demonstration (Quint et al., 1997), this paper exploits variation 
across the sites in program effects on day care use and other possible mediators of program ef- 
fects on child outcomes to disentangle effects of different aspects of children’s day care experi- 
ences on the child outcomes. Another example using data from a multisite and multigroup re- 
search design to estimate the effects of maternal education on children’s cognitive outcomes is 
described in detail in Box 2. Note that the data from this study were earlier presented as violating 
the monotonicity assumption when used to estimate the effects of maternal employment on chil- 
dren’s outcomes because welfare recipients randomly assigned to the HCD programs may have 
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Box 2 



An Empirical Example of IV Analyses with Multiple Mediators 
Using Data from a Multisite and Multigroup Research Design 

This study described here examined the effects of parents’ participation in schooling using 
data from the National Evaluation of Welfare-to-Work Strategies (NEWWS; Magnusson 
and McGroder, 2002). The evaluation of NEWWS included six programs evaluated across 
three sites (Atlanta, Georgia; Grand Rapids, Michigan; and Riverside, California) that op- 
erated in the early to mid 1990s under the federal Job Opportunities and Basic Skills Train- 
ing (JOBS) Program, which preceded the current welfare system. Temporary Assistance 
for Needy Families (TANF). The primary objective of these programs, like TANF, was to 
reduce single parents’ welfare use and increase their employment. In one condition, single- 
parent welfare recipients were assigned to a program that required most participants to 
look for work immediately, usually by attending a “job club” that lasted one to three weeks 
(this condition was termed labor force attachment, or LFA). In the other condition, partici- 
pants were placed in education and training programs (usually adult basic education or vo- 
cational training) to increase their knowledge and skills before they attempted to move into 
employment (this condition was termed human capital development, or HCD). 

Each of the three sites operated both an LFA program stressing job search as a first activity and 
an HCD program stressing basic education as a first activity, single-parent welfare recipients 
were randomly assigned to one of these program groups or to a control group. The program 
groups were required to participate in basic education or employment-related activities (de- 
pending on the group) as a condition of receiving welfare. Families who failed to meet the par- 
ticipation requirements could receive sanctions, that is, have their welfare grants reduced. 

rV models were estimated to assess what effect parents’ participation in schooling had on chil- 
dren’s cognitive test scores. Interactions between program and site were used as instruments to 
estimate the effects of two endogenous variables, participation in employment and participation 
in educational activities, on test scores assessing children’s school readiness two years after 
parents’ random assignment to the programs. 



(continued) 



Box 2 (continued) 



The first-stage IV results are presented in Table 2.1 below. They indicate that the HCD programs 
in the three sites all significantly increased parents’ participation in educational activities. The 
LFA programs in the three sites all significantly increased parents’ participation in employment. 
And three of the programs — Atlanta LFA, Grand Rapids LFA, and Riverside HCD — signifi- 
cantly increased both education and employment. Note that the effects across the sbc programs 
are different with respect to their effects on education and employment. This variation is critical 
to identifying the second-stage effects in the IV model using predicted values of education and 
employment as independent variables in models predicting child test scores. 



Table 2.1 

First-Stage IV Coefficients, F-statistics, and R-squares 
(Standard Errors in Parentheses) 



Instruments 


Months of Education 


Quarters of 
Employment 


Atlanta HCD 


2.36 


*** 


.25 




(.34) 




(.17) 


Atlanta LFA 


.60 


* 


.43 




(.34) 




(.17) 


Grand Rapids HCD 


.96 


* 


.00 




(.50) 




(.25) 


Grand Rapids LFA 


-.98 


* 


.96 *** 




(.50) 




(.25) 


Riverside HCD 


2.94 




.68 *** 




(.43) 




(.21) 


Riverside LFA 


-.36 




1.22 *** 




(.44) 




(.22) 


F-statistic for instruments 


20.90 




9.63 


Full model R-square 


.17 




.21 


Increase in R-square 








associated with 








instruments 


.040 




.015 



NOTES: Two-tailed significance levels are indicated as; = 10 percent; ** = 5 percent; 



= 1 percent. 

Covariates were included for: educational attainment and participation at base- 
line, prior earnings, prior welfare receipt, numeracy, literacy, depressive symptoms, 
mother’s and focal child’s age, number of baseline risk factors, family barriers to em- 
ployment, race, marital status, number of children, an index of one’s sense of control 
over one’s life, sources of social support, and child gender. 
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Box 2 (continued) 



The results of the second-stage model using predicted employment and education from the 
first-stage equation as a predictor of children’s test scores are presented in Table 2.2. Two 
models were estimated in the second-stage equation. In the first model, only the predicted 
value of education was used as an independent variable. In the second model, both the pre- 
dicted value of education and the predicted value of employment were used as independent 
variables. In both models, participation in education had a positive, significant effect on chil- 
dren’s test scores. The effects of employment were significant in comparative OLS models, 
but not in the rV model. 



Table 2.2 

OLS and TV Estimates of Months in Educational Activities on 
Children’s Raw Bracken School Readiness Composite Scores 
(Standard Errors in Parentheses) 



Model 1 : Bracken Model 2: Bracken 



Independent variables 


OLS 


IV 


OLS 


IV 


Months in education 


.089 ♦♦♦ 


.305 ♦ 


.098 ♦♦♦ 


.311 ♦ 




(.035) 


(.168) 


(.035) 


(.169) 


Quarters of 






.134 ♦ 


.671 


employrrient 






(.070) 


(.493) 



NOTES: Two-tailed significance levels are indicated as: * = 10 percent; ** = 5 percent; *** = 1 
percent. 



Covariates were included for: educational attainment and participation at baseline, prior 
earnings, prior welfare receipt, numeracy, literacy, depressive symptoms, mother’s and focal 
child’s age, number of baseline risk factors, family barriers to employment, race, marital status, 
number of children, locus of control, sources of social support, and child gender. 



initially reduced their employment in order to pursue more education. The monotonocity assump- 
tion is not violated, however, when these data are used to estimate the effects of maternal educa- 
tion on children’s outcomes. One way to assess bias due to the violation of monotonicity in this 
case is to compare the IV estimates of the effects of maternal education using data from the HCD 
programs with the IV estimates on the effects of maternal education from the LFA programs. 
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Creating Multiple Instruments: Subgroups from Randomized 
Experiments 

A similar approach can be implemented in which variation in the responses of particular 
subpopulations to a program Pi is used to create multiple instrumental variables. In this case, the 
random assignment treatment variable is interacted with one or more exogenous baseline char- 
acteristics, such as age or gender. Thus, the baseline characteristic serves as a covariate in both 
equations, and the interaction of the program variable and the baseline characteristic serves as 
one of the instruments. In equation form, this can be written as: 

Yi =m+ > where for every X.,„ (1 1) 

1 1 

^in, =n+ > (12) 

1 1 

in which Zis is a series of exogenous baseline variables, s < k, and s S m. 

Conceptually and technically, this approach is identical to the use of different sites as 
variables Zis. In practice, interacting the program variable with demographic characteristics or 
other baseline variables as well as by site may be problematic. Variation in program effects 
across different subgroups in the same location or across locations may not be truly exogenous 
to a measure of child well-being. Selection — that is, something unique about the subgroups or 
sites that drives program effects — could account for part of this variation, which would un- 
dermine the rV estimates’ validity. It is necessary to ensure that the relationships between the 
endogenous variables Xi and the outcome Yi are also not significantly moderated by the exoge- 
nous baseline characteristics Zis or by site. In other words, the effect of^ on Yi must be the same 
across different levels of Zis (for example, if instruments were created by interacting the pro- 
gram group variable with child age or child gender, the relationship between child care use and 
children’s school readiness must be the same across child age or child gender). 

Creating Multiple Instruments: Pooling Data from Multiple 
Experiments 

A final alternative for constructing multiple instruments is to construct a pooled data set 
or a data set that combines information fi*om multiple random assignment experiments. Pooling is 
only possible if the type and quality of the data as well as the outcomes of interest are comparable 
across the pooled studies. The pooled data, as described in Box 3, offer the benefits of having an 
instrument for each respective random assignment study. These multiple instruments can be used 
for either of two purposes: (1) to estimate multiple mediators or (2) to increase the precision of 
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estimates for a single mediator, a topic that we turn to in the next section. Getting comparable in- 
formation from enough random assignment experiments is no small feat, and it can often take 
years before enough studies have been conducted for such an effort to be productive. 



Box 3 

An Empirical Example Using Pooled Data from Multiple 
Random Assignment Experiments to Estimate the Effects of Income, 
Employment, and Child Care on Children's Well-Being 

In this example, we attempt to answer questions about the effects of income, employment, and 
child care on children’s well-being using a pooled data set of experimental studies of welfare 
and work programs. Each study evaluated a program using a random assignment design, and 
comparable data on families and children were collected across studies. 

The primary equation of interest is: 

where i represents each child, F is a measure of participation in formal child care, E is a meas- 
ure of parents’ employment, and I is a measure of family income. The outcome variable is a 
measure of children’s cognitive functioning. The Z’s are a variety of controls or covariates pre- 
dicted to affect children’s cognitive functioning, such as age and education of the mother and 
number of siblings, and the error term is represented by Ri . Using OLS techniques, the effects 
of F, I, or E on Y could be biased if they are correlated with the error term. Instrumental vari- 
ables models can control for such biases. 

In this case, the first stage in estimating such a model would require estimating three models 
that look something like: 

£/ =i& + Zjg + Pfi + Of 
li=m> 

F/ -h PPPif^ + SI 

These first three equations derive^, / , and F . These predicted measures of employment, in- 
come, and formal child care replace the actual measures in the first equation displayed in this 
box. The resulting estimates of the effects of formal child care, income, and employment on 
children’s cognitive functioning will be free of bias if the IV assumptions noted above hold. 

(continued) 
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Box 3 (continued) 



The Ps in these equations represent the instruments used to identify the first-stage equa- 
tions. A minimum of three instruments, one for each model, are needed to estimate this 
model. The pooled data from multiple welfare and work programs offer a number of 
possibilities. The challenge is to construct a set or sets of instruments that will (1) reliably 
predict the outcome of interest (employment, income, or formal care) and (2) reliably dis- 
tinguish the prediction of one outcome from the prediction of another outcome (e.g., do a 
better job at predicting employment than predicting formal care). 

Three instruments that represent differing policy approaches can be used to identify the 
above equations. Specifically, some of the studies evaluated programs with expanded 
child care resources that affected parents’ use of formal as opposed to informal care. 
Some of the studies increased employment through mandatory employment services but 
did not increase their income (welfare recipients in these programs traded their welfare 
checks for earnings). Finally, some of the studies evaluated programs that provided fi- 
nancial incentives to work: These increased both employment and income. The logic 
here is that particular policy approaches in these welfare and work programs had unique 
influences on certain outcomes. In other words, the programs did not all affect formal 
care, employment, and income in the same way. 



All of these approaches (multiple treatment groups within a study, interactions with 
site, interactions with subgroup variables, and pooling data across studies) can be combined to 
create a sufficiently large number of instruments to conduct an instrumental variables analysis 
with multiple mediators. This can be useful in the case of additional mediating variables, as 
well as for the purposes of verifying the validity of the instruments (by creating overidentified 
models; see footnote 8). 

Estimation Issues: The Problem of “Weak” Instruments 

The prior sections provided a general framework for understanding the policy questions 
that could be answered by instrumental variables analysis and the necessary assumptions for 
identifying causal effects to answer such questions, including the use of multiple instruments to 
identify multiple mediators. Even when all of the assumptions described above are met and an 
appropriate IV estimator is identified, actual estimation of instrumental variables poses a set of 
new issues concerning both the validity and reliability of the estimates. As discussed earlier, to 
obtain consistent and reliable instrumental variables estimates, a good instrument Pi must be 
highly correlated with the policy variable Xi. In the following section, we will discuss some of 



-27- 



ERIC 




the drawbacks of using “weak” instruments (i.e., instruments that do not have strong correla- 
tions with the policy variable). 

There are several risks in interpreting IV estimates obtained using a weak instrument or 
a set of weak instruments. First, there is the risk of having large errors on the IV estimates in the 
second stage of the procedure, which would make the estimates unreliable. Second, weak in- 
struments can produce IV estimates that are vulnerable to bias due to chance correlations be- 
tween the error terms in the different stages of the IV procedure. We discuss both of these risks 
in more detail below. 



The Cause of Weak Instruments: Weak Program Effects 

While the use of a randomly assigned program variable as an instrument avoids the 
problem of correlation with omitted variable(s), the possibility that a randomly assigned pro- 
gram variable is a weak instrument is real. Even if a program has its intended effect on a policy 
variable (e.g., increases vocational training or employment), assignment to the program group 
may not be the most important predictor of the policy variable relative to other potential predic- 
tors. For example, the effect of a program that seeks to increase employment may be small rela- 
tive to other predictors of current employment, such as prior employment experience, educa- 
tional background, or current family circumstances. This is particularly true when random as- 
signment to treatment is used to predict variation in policy variables Xi that are not primary tar- 
gets of the experiment. For example, when training or employment programs Pi increase the use 
of child care for parents with young children, random assignment to such programs can be used 
as an instrument to predict child care use (as long as a separate instrument is used to predict em- 
ployment, as described in the section on multiple mediators), but the strength of the relationship 
and the relevance of the instrument may be limited because directly encouraging child care use 
was not the original intent of the program. Consequently, the program variable Pi is likely to be 
a better predictor of variation in employment than of variation in child care use. 

This problem is compounded when predicting multiple mediators. In this case, obtain- 
ing reliable estimates will depend on strong program effects on each of the different outcomes 
as well as variation in program effects on different outcomes across the instruments. For exam- 
ple, if a pooled data set is composed of data from a set of random assignment studies of em- 
ployment programs and all of the studies increased employment and did not have any effects on 
income or child care, then the program variables (even though there is more than one) cannot be 
used as multiple instruments to predict income or child care. 
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One Consequence of Weak Instruments: Unreliable IV 
Estimates 

To better understand the implications of having an irrelevant or weak instrument, recall 

Equation 2. If Pi is not relevant to Xi or if Pi is a very weak predictor of ^ , then — ^ ~ the 

dPi 

independent share of the variation in Xi that is due to Pi “ will be negligible relative to the total 
variation in Xi and not sufficient to purge the relationship between policy variable Xi and out- 
come Yi of spurious covariation. This leads to invalid, inconsistent, or unreliable IV estimates, 
as evidenced by large standard errors on the IV estimates in the second stage of the procedure. 

An example of the implications of weak program effects is described in Box 2. In this 
case, even though the program was designed to increase employment and income (the two en- 
dogenous variables that are being estimated in a first stage), program effects on employment 
and income were stronger during the first year relative to the average effects during the three- 
year follow-up period. As a result, the IV estimates of the effects of income on child well-being 
are more precisely estimated using the measure of income during the first year. The IV esti- 
mates of income averaged over the three years were similar in magnitude but were less precise 
(i.e., the estimate on income had a much larger standard error). This can be seen by comparing 
the results of model 1 with those of model 2 in Table 2.2. 

A Second Consequence of Weak Instruments: Biased IV 
Estimates 

In finite samples, even good instruments cannot ensure that estimates are unbiased.^ 
Consider a finite sample with an instrument P/, an outcome variable Ti, and a mediator^. There 
is no true effect of on 7,, but both Xi and Yi are correlated with an unmeasured variable Z,. The 
instrument Pi is a randomly created program variable and is uncorrelated with Z/ by construc- 
tion. However, in a finite sample, P, may be correlated with Z/ by chance. Through this correla- 
tion, and in this finite sample, the instrument P/ will reintroduce a spurious effect of Xi on Yi (re- 
ferred to as “finite sample bias”). Intuitively, finite sample bias arises because IV estimates rely 
on the preciseness of the estimates of the first-stage coefficient rather than the coefficient’s ac- 
tual value. Even if there were no relationship between P/ and Xi, the estimates of the relationship 



^An unbiased estimate means that the estimate has a sampling distribution centered on the parameter of in- 
terest in a sample of any size. Because IV estimates are based on a ratio of random quantities, the expectation 
of such a ratio does not necessarily have a simple form. A consistent estimate means that the parameter con- 
verges to the population parameter as the sample size grows. 
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between Pi and Xj would not be zero in any finite sample. A number of researchers have exam- 
ined the finite sample properties of IV (e.g., Sawa, 1996; Staiger and Stock, 1994),^ 

Using large samples is one way to minimize this bias as well as to have more precise 
estimates (i.e,, estimates with smaller standard errors). However, when the relationship between 
instrument Pi and policy variable Xi is weak, such finite sample bias can be significant even 
when samples are very large. Bound et al, (1995) demonstrate that the typical method for mini- 
mizing finite sample bias, increasing the sample size, does not solve this problem, especially for 
estimates obtained fi*om large cross-sectional samples. In other words, large data sets do not 
necessarily insulate IV estimates fi*om finite sample bias caused by weak instruments. 

Addressing the Problem of Weak Instruments 

Given the risks of using a weak instrument, one useful guideline in pursuing FV estima- 
tion is to perform a close examination of the characteristics of the first-stage equation or equa- 
tions. The stronger the relationship between Pi and Xi (i.e., the greater the partial on the ex- 
cluded instruments), the lower the likelihood that weak instruments will bias the IV estimates. 
Testing estimates with alternative instruments is one potential approach to assessing the robust- 
ness of rV estimates or creating bounds for these estimates similar to confidence intervals on tradi- 
tional OLS estimates. However, when randomly assigned program variables are used as instru- 
ments, there usually is no ready supply of such alternative instruments. 

There are other techniques to address the problem of finite sample bias in cases where 
the available instrument or instruments are fairly weak and no good alternatives are available. 
The key to making such techniques work is to break the link between spurious correlations be- 
tween Pi and Xi and similar correlations between Pi and T). One such technique relies on split 
sample estimates (see, e.g., Angrist and Krueger, 1995). In this approach, two samples are 
drawn fi*om a single population, ideally at random, and the first and second stages of the instru- 
mental variables analysis are carried out separately on those two independent samples. That is, 
the first sample is used to estimate the relationship between Pi and Using the regression coef- 
ficients fi*om this analysis, X , a predicted value of is estimated in the second sample, and the 
outcome Yj is regressed on Xj in the second stage of the analysis, resulting in , an FV esti- 
mate of the effect of A" on K Angrist and Krueger (1994) show that sampling variation tends to 
bias toward zero, as P/ is never going to predict^ as well as Xi and will therefore introduce 
additional random error into the estimation of the relationship between Xj and Yj. However, this 
is usually preferable to a situation in which is biased towards the original biased OLS esti- 

^Examining the partial and the F-statistic on the instruments in the first-stage regression helps gauge the 
potential finite sample bias of IV relative to OLS (Bound et al., 1995). In fact, there are useful guidelines for 
assessing whether or not the explanatory power of the instruments in the first stage is adequate. 
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mator, as is the case when finite sample bias is present. Currie and Yelowitz (1997) apply this 
technique using data in a paper that explores the relationship between living in public housing 
and child outcomes. 

Ultimately, given any finite sample, choosing instruments requires striking a delicate 
balance between efficiency and bias. The best instruments are those that obtain FV estimates that 
are asymptotically efficient and\iB\Q small finite sample bias. Though randomized experiments 
can provide a valid instrument that may yield consistent FV estimates, a different choice of in- 
strument will yield different estimates in any finite sample. Thus, there are risks to changing (or 
increasing the number of) instruments in finite samples without adhering to some empirical (or 
quantifiable) standard about the quality of the instrument. It is straightforward to determine the 
quality of an instrument empirically in the case when only one endogenous variable is being 
considered, but the risks associated with using weak instruments escalates when more than one 
instrument is needed to identify an IV model. 



Discussion and Conclusions 

For years, policymakers and researchers have grappled with developing empirical tech- 
niques to better understand important relationships between economic behavior and self- 
sufficiency and family or child well-being. Experiments offer the kind of exogenous variation 
that can help to empirically identify these relationships. There are experiments that occur natu- 
rally and those that we can create, and both have their virtues as well as problems. We argue 
that applying instrumental variables techniques to data from random assignment designs can be 
a powerful method for answering important policy questions. The goal in this chapter was to 
make the understanding and application of instrumental variables techniques accessible to a 
wide range of policymakers and researchers. 

The availability of data from numerous recent random assignment studies (e.g., of wel- 
fare and employment programs) provides a unique opportunity for researchers to dig into the 
black box and tackle difficult questions about how programs affect outcomes. FV estimates do 
not on their own answer all policy-relevant questions but can provide policy-relevant estimates 
(i.e., local average treatment effects). 

The application of instrumental variables, however, should not be foolhardy. One of the 
benefits of using a randomly assigned treatment as an instrument is that many of the key as- 
sumptions of instrumental variables techniques are identified. Nonetheless, two of these as- 
sumptions — monotonicity and the exclusion restriction — must be carefully checked and ad- 
dressed. As the empirical examples in this paper show, it is frequently the case that more than 
one instrument is needed because programs have multiple goals and are likely to directly affect 
multiple outcomes, and it is almost never the case that even multigroup research design pro- 
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grams produce substantially different effects on related outcomes (e.g., it is unusual to have an 
experiment with a three-group research design in which one of the programs produces substan- 
tially different effects on a particular outcome relative to the other program). In addition, as 
shown in Box 2, potential violation of the monotonicity assumption implies that researchers 
need to consider carefully whether the random assignment design and resulting program effects 
are appropriate for answering the policy or research question of interest. 

While substantial progress has been made in understanding the assumptions underlying 
instrumental variables estimation, these assumptions are best understood under specific condi- 
tions, that is, when there is one policy variable of interest. Estimating multiple mediating 
pathways remains a key methodological challenge in instrumental variables analyses involving 
the currently available data from randomized experiments. Very few social policy experiments 
to date have included multiple randomly created “treatments,” and many that do were not de- 
signed to produce significant variation in a range of important policy variables. This leaves re- 
searchers with little choice but to create multiple instruments based on variation of program ef- 
fects across sites and subgroups. This approach is promising in some cases but can reintroduce 
bias that the instrumental variables procedure was designed to remove. It also relies on variation 
in program implementation across sites or subgroups, variation that often is not substantial 
enough to produce reliable estimates free of finite sample bias. 

In addition to expanding opportunities to learn from current data, instrumental variables 
estimation highlights important future research opportunities. Randomized experiments could 
be designed specifically to produce significant variation in key policy variables in such a way 
that unbiased estimates of the effects of those variables on key outcomes could be obtained. The 
focus in such studies would not be on the program effects per se but on the secondary effects of 
the mediators on the outcome. Combining the tools of instrumental variables and random assign- 
ment in such designs from the outset could dramatically improve the quality of instrumental vari- 
ables analyses based on random assignment. One hypothetical example is described in Box 4. Pol- 
icy researchers and program operators should work together to identify opportunities for such 
studies, which would be more useful than traditional random assignment experiments and more 
valid than traditional nonexperimental approaches to policy research. 



* ^Another condition in which IV assumptions are well understood (which we do not discuss here) is when 
the underlying outcome of interest is linear. 
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Box 4 



Example of a Possible Future Experiment that, with IV, 

Can Measure Causal Relationships 

A key welfare policy question concerns the extent to which transitional Medicaid benefits affect 
the well-being of welfare leavers and their ability to remain off welfare. These benefits are ex- 
pensive to administer, and take-up is generally low. Policymakers are concerned that families 
who do not use transitional Medicaid are less likely to use preventive medical care and more 
likely to end up in emergency rooms. It is difficult to assess the effects of transitional Medicaid 
on families, because take-up of these benefits is selective. More advantaged individuals are more 
likely to know about the program, whereas less healthy individuals are more likely to find out 
about it because of an illness or hospital visit. 

A random assignment study with an IV component could be designed to assess the effectiveness 
of these transitional benefits. Such a study would use a so-called “encouragement design” in 
which we would not change or extend Medicaid benefits (which would be very expensive) but 
rather seek to increase awareness of existing program services among those already eligible, 
keeping track of those who were randomly targeted for additional information and help in access- 
ing benefits. Using instrumental variables, it would be straightforward to estimate the effect of 
the transitional Medicaid services fi*om the experimental effect of the encouragement of its use 
(provided that the effect is sufficiently large). 

The actual treatment in a study like this could have a tiered structure. (Using multiple tiers would 
help in the encouragement design by providing multiple potential instruments.) For example. 
Level I could be to simply send a random subset of eligible people in a county a letter as soon as 
they become eligible, providing a phone number and explaining in multiple languages what the 
program is like and how it could help. Level n would be to call or visit eligible families in order 
to take a more active role in making sure they use the service. Level IH would be to add an om- 
budsman type of person, who would assist with eligibility determination, advocate with doctors 
and dentists over acceptance of benefits, and help resolve other administrative problems that 
come up in use of benefits. Obviously, the cost of administering these treatments would increase 
with their extensiveness, but the cost would never include the prohibitive expense of actually 
providing benefits. 
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